BioData Mining — Latest Matching Preprints

1

Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics 10.64898/2026.05.04.722226 medRxiv

Top 0.1%

3.6%

Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

2

A Multimodal Clinical Dataset of Early Adversity, Placement History, and Prenatal Exposures in Adopted and Foster Care Children

Sullivan, C. R.; Anderson, S.; Caola, L.; Rawstern, T.; Loleng, J.; Roghair, J.; Dastin-Van Rijn, E.; Gustafson, K.; Randolph, A.

2026-05-29 pediatrics 10.64898/2026.05.27.26354273 medRxiv

Top 0.1%

3.6%

Show abstract

We assembled a multimodal clinical dataset describing demographics, placement history, prenatal substance exposure (PSE), birth characteristics, adverse childhood experiences (ACEs), International Classification of Diseases (ICD) diagnoses, and laboratory results for 3,685+ pediatric patients evaluated between 2014 and 2024 at the University of Minnesotas Adoption Medicine Clinic (AMC). Data were curated from electronic medical records through a combined manual and automated extraction protocol using a standardized operating procedure. The resulting dataset integrates structured EMR fields including neuropsychological, laboratory, and diagnostic information with manually pulled fields of ACE scores, PSE history, and placement history. We provide an overview of the population represented and describe the datasets structure, variable definitions, and validation procedures. This resource enables investigations into how early adversity impacts medical and developmental outcomes, and provides one of the largest standardized clinical placement history, PSE, and ACE datasets in an adoption and foster care pediatric population.

3

Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support

Khan, U. A.

2026-05-11 bioinformatics 10.64898/2026.05.05.723102 medRxiv

Top 0.2%

2.4%

Show abstract

Retrieval-augmented generation (RAG) is increasingly applied to clinical decision support in oncology, where treatment selection depends on identifying a patients specific somatic variant from an NGS report and matching it to evidence-graded therapy options. The vector retrieval that underlies most RAG systems uses cosine similarity over text embeddings, an architecture optimized for linguistic proximity rather than entity-level identity. We hypothesize that cosine-similarity-based retrieval conflates clinically distinct cancer variants at clinically relevant rates, while a typed-graph approach in which each variant is a discrete node preserves variant-level identity by construction. We evaluated 9 cancer variant pairs known to have differential FDA-approved therapy indications, with variant identity informed by the CIViC clinical variant evidence database and primary clinical literature. Variant pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. We computed pairwise cosine similarity for each variants text representation across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) and three text formats (short, medium, long). Across the medium format (gene + variant + tumor type), 100% of clinically distinct variant pairs (9/9) had cosine similarity [≥] 0.95 under both biomedical encoders (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) showed lower conflation in the medium format (11%) but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56% of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. The typed-graph baseline achieves zero conflation by construction. We discuss the architectural implications: vector retrieval is appropriate for unstructured literature search but introduces unsafe ambiguity when used as the substrate for variant-level reasoning that drives drug-selection decisions. We argue that typed-graph retrieval should be the default architecture for any retrieval-grounded clinical decision support system that recommends targeted therapy.

4

Incidence and Severity of Carboplatin-Associated Hearing Loss in Children with Cancer Assessed by the SIOP 2012 Ototoxicity Criteria

Chawla, A.; Carter, S.; Wood, A.; Staffieri, S.; Dodgshun, A.; Eisenstat, D.; Sullivan, M.

2026-05-30 pediatrics 10.64898/2026.05.21.26353442 medRxiv

Top 0.2%

2.1%

Show abstract

Background: Platinum-based chemotherapy is known to cause severe and debilitating hearing loss, but unlike cisplatin, the true incidence of carboplatin-induced hearing loss remains unclear. We evaluated functional hearing outcomes in children receiving carboplatin to determine the incidence and severity of ototoxicity. Procedure: We identified a large cohort of children with cancer treated with carboplatin and graded their audiograms using the SIOP ototoxicity scale. Patients with inadequate audiological follow-up, prior hearing loss, or exposure to cisplatin were excluded. Fishers exact test, logistic regression, and ROC analyses were performed to investigate associations of demographic, treatment, and exposure-related risk factors with incidence of hearing loss. Results: 200 patients were included, all of whom had been treated with carboplatin. Only nine (4.5%) patients developed clinically significant hearing loss (SIOP grade [≥]2). Younger age at first exposure to carboplatin was the only significant predictor of hearing loss (OR = 0.7888, p=0.0241). Age [≤]28 months was significantly associated with hearing loss (OR 12.37, p=0.0042). No other risk factors or exposures were statistically significant. Conclusions: Clinically significant carboplatin-associated hearing loss was uncommon (incidence 4.5%). We show that young age is the single-most important risk factor for hearing loss; of nine children who developed hearing loss, eight were aged [≤]28 months. Children below this age have twelve-fold higher odds of developing hearing loss compared to those above this age (OR 12.37). These findings will allow physicians to provide more appropriate counselling to families regarding ototoxic risk and support intensified hearing surveillance in young children.

5

Structured large language model extraction of clinical factors from electronic health record text supports scalable psychiatric severity prediction

Stephenson, C.; Camassa, A.; Wagner, M.; Shirazi, A. H.; Alavi, N.; Omrani, M.

2026-05-13 psychiatry and clinical psychology 10.64898/2026.05.11.26352839 medRxiv

Top 0.2%

2.1%

Show abstract

BackgroundMental health systems face escalating demand that exceeds clinician capacity, making accurate severity-based triage a critical bottleneck. Severity assessment guides treatment intensity, resource allocation, and risk management, yet most clinically relevant information remains embedded in unstructured electronic health record (EHR) narratives, limiting its utility for scalable decision support. ObjectivesThis study evaluates whether a single large language model (LLM) can autonomously extract clinical factors from psychiatric EHR narratives, derive predictive weights from those factors, and use the resulting structured representation to predict clinician-implied severity at scale. MethodsFrom a Mayo Clinic repository of more than 2.7 million encounters, 15,000 de-identified psychiatric notes were sampled into a 5,000-patient discovery cohort and a 10,000-patient replication cohort. The same LLM (Llama 3 8B Instruct) extracted 17 background clinical factors and 3 treatment-action factors from each note. Severity reference labels were derived from the treatment-action factors using pre-specified clinical criteria. The LLM independently derived two factor-weight dictionaries from the discovery cohort: one capturing risk-oriented predictors of severe presentations and one capturing protective predictors. Five weighting conditions were then evaluated against the severity labels: the two LLM-derived dictionaries, two controls (LLM-derived variables with randomized weights; clinically irrelevant variables with arbitrary weights), and an unweighted zero-shot baseline. Performance was assessed across 928 valid iterations in the replication cohort. ResultsLLM-derived structured conditions significantly outperformed all controls and the baseline, with statistically equivalent performance between the two structured conditions. Improvements in precision and recall were balanced, indicating gains in discriminative capacity rather than threshold shifts. The variables and weights the LLM derived as predictors of severe presentations aligned closely with established clinical determinants of psychiatric severity. ConclusionA single LLM can derive clinically meaningful factor weights from unstructured EHR narratives and use them to predict psychiatric severity at scale, supporting a viable path toward interpretable, scalable triage in resource-constrained mental health systems.

6

CausalKnowledgeTrace: A Novel Computational Framework for Automated Literature-Based Causal Graph Construction and Evidence-Based Variable Selection in Biomedical Research

Upadhayaya, R.; Pradhan, M. M.; Metzger, V. T.; Malec, S. A.

2026-05-12 bioinformatics 10.64898/2026.05.07.723601 medRxiv

Top 0.2%

1.8%

Show abstract

BackgroundVariable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale. MethodsWe developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection. ResultsAnalysis of the hypertension-Alzheimers relationship across three degree neighborhoods (1-3) demonstrated systematic scaling of causal complexity: 361-866 variables, 429-1,442 relationships, with graph densities of 0.0033-0.0019. The analysis revealed complex cyclic structures with 54-606 baseline cycles across degree levels. Processing times ranged from 0.3-1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators--including obesity, oxidative stress, ischemia, and vascular diseases--all were found to have strong supporting evidence in established epidemiological and pathophysiological literature. ConclusionsCausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research. Statement of SignificanceO_ST_ABSProblem or IssueC_ST_ABSSelecting proper confounders and variables for causal inference from observational biomedical datasets is challenging and often biased by limited expertise or manual review. What is Already KnownExisting approaches rely on domain experts, statistical variable screening, or manual construction of causal graphs, but these often overlook literature-documented confounders and complex biases. What this Paper AddsThis paper introduces an automated, literature-based framework for synthesizing and validating causal graphs, identifying critical variables and complex bias structures, such as M-bias and butterfly bias, with full evidentiary traceability. Who would benefit from the new knowledge in this paper?Epidemiologists, biomedical researchers, informaticians, and clinical investigators seeking reliable and transparent causal modeling for observational studies.

7

Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv

Top 0.3%

1.7%

Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

8

Evo 2 Predicts Cardiomyopathy-Associated Variants and Elucidates Their Underlying Mechanisms

kurozumi, a.; otsuka, n.; Masamichi, I.; kawakami, t.; Isagawa, T.; kodera, s.; takeda, n.

2026-05-17 genomics 10.64898/2026.05.15.725304 medRxiv

Top 0.3%

1.7%

Show abstract

BackgroundAlthough advances in next-generation sequencing have accelerated the identification of genetic variants in cardiomyopathy, interpreting variants of uncertain significance (VUS) remains a clinical challenge. Evo 2 is a high-resolution genomic artificial intelligence model capable of predicting pathogenicity across large sequence contexts and enabling mechanistic interpretation; however, its application in cardiovascular genetics is limited. Here, we evaluated the utility of Evo 2 for assessing the pathogenicity and underlying mechanisms of cardiomyopathy-associated variants. MethodsWe used Evo 2 to predict the pathogenicity of single-nucleotide variants in cardiomyopathy-related genes listed on ClinVar. We assessed the ability of the model to identify characteristic structural features in both coding and noncoding regions using internal representation such as embeddings, and to infer the molecular mechanisms of variants within these regions. ResultsEvo 2 demonstrated high predictive accuracy for pathogenicity, achieving an AUROC of 0.983 and an AUPRC of 0.915. Notably, sparse autoencoders (SAEs) from embeddings identified features corresponding to higher-order structural features, including coiled-coil and actin-binding domains characteristic of cardiomyopathy-related proteins, and accurately detected mutations known to disrupt these domains. The model recognized the binding motif of the cardiac-enriched transcription factor TBX5 with SAEs and accurately predicted a single-nucleotide polymorphism affecting TBX5 binding affinity after supervised fine-tuning. ConclusionsEvo 2 demonstrated strong performance for both predicting pathogenicity and extracting biological features of cardiomyopathy-associated variants. It may represent a powerful emerging tool for evaluating VUS in cardiovascular medicine.

9

A Beta-Binomial Model for Estimating Zero- or One-inflated Pain Trajectories

Liu, Y.; Harris, R. E.; Clauw, D.; Bayman, E.; Leroux, A.; Lindquist, M. A.

2026-05-11 bioinformatics 10.64898/2026.05.07.721507 medRxiv

Top 0.4%

1.4%

Show abstract

Chronic pain is a widespread public health issue that imposes substantial health, emotional, and economic burdens on individuals and communities. Because pain is subjective and lacks objective biomarkers, it is typically measured using patient-reported scores, often on a numerical scale from zero to ten. Increasingly, pain studies use ecological momentary assessment, with multiple daily assessments over days and across study phases (e.g., a series of baseline and post-intervention assessments). These data frequently show many ratings at the extremes (i.e., at minimum or maximum pain scores), commonly referred to as zero- and one-inflation in the statistical literature, along with considerable within-person variability both within and across days. These phenomena present challenges for statistical analyses, as they violate assumptions of most commonly used statistical techniques (e.g., the normality assumption of linear mixed models). We propose a Bayesian beta-binomial mixed-effects model for modeling potential zero- or one-inflated pain scores while accounting for variability using random effects on the mean and variance parameters across subjects. A simulation study demonstrates that the method accurately estimates model parameters across realistic sample sizes, time points, and zero- and one-inflation levels. An application to data from two longitudinal pain studies demonstrates that the model fits the data better and, when correctly specified, yields accurate uncertainty intervals for longitudinal changes in pain compared to existing models, especially for zero- and one-inflated outcomes. Additionally, the model directly estimates the probability of clinically meaningful pain events. The proposed method provides a powerful statistical framework for studying the patient-reported pain trajectories.

10

Making Course Structure Visible in a Multi-Instructor Graduate Genomics Course: A Course-Level Evaluation of Standardized Learning Supports

SAITOU, M.; Diblasi, C.

2026-05-08 scientific communication and education 10.64898/2026.05.06.723173 medRxiv

Top 0.5%

1.3%

Show abstract

Graduate-level genomics courses require students to integrate dense material across subfields, concepts and methods. In modular, multi-instructor courses, students may struggle because the coherence between lectures can be difficult to navigate, while the course structure may be visible to instructors. We evaluated a 2025 navigation redesign of BIO322, a graduate genomics course at the Norwegian University of Life Sciences, while preserving course content, multi-instructor teaching, modular organization and assessment framework. The redesign includes introducing a standardized self-learning guide, expanded syllabus, enriched online quiz feedback, and added support for a final group research proposal. Using anonymized course evaluation scores from 2021-2025 and aggregated learning management system access data from 2023-2025, we examined student experience and resource use. In 2025, five of six course evaluation items reached their highest observed BIO322 scores, while one, lecture-specific score remained within the previous range. The consolidated self-learning guide was accessed by nearly all students, whereas access to optional readings declined across the course sequence, despite comparatively stable page views per accessing student. These course-level findings are consistent with improved perceived navigability following the introduction of standardized learning support. However, some students continued to report difficulty identifying priorities and connections among course components, indicating that challenges in perceived course coherence remained for part of the cohort despite the redesign. Practitioner PointsO_LIMaking course structure explicit may improve students perceived navigability in multi-instructor graduate genomics courses. C_LIO_LIA centralized self-learning guide can broaden access to preparatory guidance without changing core course content or assessment. C_LIO_LIOptional learning supports may be used unevenly, so resource availability should not be assumed to translate into uniform resource access. C_LI

11

Machine learning methodology using a masked neural network for robust genetic risk score calculation from noisy and missing data

Squires, S.; Weedon, M. N.; Oram, R. A.

2026-05-20 genetic and genomic medicine 10.64898/2026.05.18.25341725 medRxiv

Top 0.5%

1.2%

Show abstract

Purpose: Genetic risk scores (GRSs) are summaries of genetic data that can improve prediction of disease risk and progression. GRSs are increasing available but rely on high quality input data to produce good output results; with noisy or missing inputs the GRS may be inaccurate. We aimed to develop a method to produce a robust estimate of the GRS when input data is missing, noisy or both. Approach: We developed a neural network approach, named masked-MLP, for robust GRS calculation trained on a set of GRS scores calculated on clean data. The masked-MLP includes additional input data and has noise inserted during training, both which make the model more robust. Results: A GRS for type 1 diabetes (T1D) calculated on input data with 10\% of the data corrupted had a Spearman rank correlation to the clean GRS of 0.669 (0.665-0.674) while the equivalent for the masked-MLP was 0.951 (0.950-0.952). For the same data the area under the receiver operating characteristic curve for separation of T1D from population samples fell from 0.919 (0.904-0.932) to 0.808 (0.787-0.827) for the GRS while the masked-MLP fell to 0.910 (0.895-0.924). Conclusions: The masked-MLP was more robust to noise when calculating a GRS than using standard approaches. Our approach has the potential to ensure both improved research and clinical outcomes due to more reliable GRS calculation.

12

Cross-Model Variability in Large Language Model Triage Behavior for Potential Stroke Symptoms

Dworkis, D. A.; Stenstrom, J.; Sen, A.; Lucarelli, R. T.

2026-05-25 emergency medicine 10.64898/2026.05.22.26353904 medRxiv

Top 0.5%

1.1%

Show abstract

Background: Stroke is a time-sensitive neurological emergency in which early EMS activation and presentation to definitive care are cornerstones of effective therapy. Large language models (LLMs) are increasingly consulted by the public for medical advice, but the veracity of the guidance provided by commercially available models responding to potential stroke symptoms is not well understood. Methods: We performed a cross-model benchmarking study comparing the triage choices of three frontier LLMs (Claude Sonnet 4.6, GPT-4o, and Llama 3.3-70b-versatile) on first-person vignettes describing a unilateral arm symptom on waking, across 10 symptom descriptors, and two clinical phases (before and after a partially reassuring self-examination), with or without a clinical distractor (n=50 per condition). Results: Claude sought emergency care most often, Llama least, and GPT-4o in between, diverging most sharply in the post-examination phase where Claude called 911 in 100% of runs, Llama called for non-emergency help in 100%, and GPT-4o was symptom-dependent. A distractor shifted behavior away from emergency care in almost all conditions: calling 911 fell from 37.9% to 14.6% and waiting rose from 0% to 45.9% in the post-examination vignette. Responses were also sensitive to symptom word: weak, limp, heavy, and clumsy generated higher alarm, whereas numb, tingly, odd, strange, and weird generated less urgent responses. Conclusions: The increasing use of LLMs for medical advice has significant public health implications. Commercially available LLMs show significant model-to-model variability and framing sensitivity when confronted with potential stroke symptoms, including under-recognition of canonical CDC warning descriptors, underscoring the need for systematic benchmarking as these tools become de facto first points of contact for patients experiencing neurological emergencies.

13

Fisher information matrix computation for joint longitudinal and survival models to support clinical study design and covariate effect assessment

Fayette, L.; Brendel, K.; Mentre, F.

2026-06-01 pharmacology and therapeutics 10.64898/2026.05.28.26354340 medRxiv

Top 0.5%

1.1%

Show abstract

Joint modelling of longitudinal data using non-linear mixed effects models and time-to-event outcomes provides a suitable framework to account for informative censoring when estimating biomarker dynamics and quantifying event risk using covariates and longitudinal trajectories. Their usefulness in clinical research depends on data collection design, particularly to precisely estimate the association (link) parameter between longitudinal and survival processes. However, optimal design strategies have so far been addressed separately for longitudinal and survival endpoints and remain unexplored for joint models. We propose two Fisher Information Matrix (FIM) computation methods for joint models, relying on Monte-Carlo integration over observations combined with either Markov Chains Monte-Carlo or Adaptive Gaussian Quadrature to integrate random effects. Their accuracy is assessed against clinical trial simulations in an oncological example based on the HORIZON III study with a tumour-growth-survival model including discrete and continuous covariates. We apply these methods to quantify the impact of follow-up duration, sampling richness, sample size, and covariate distribution on parameter uncertainty and test power. In our example, longitudinal-parameter uncertainty is barely affected by follow-up duration or sampling richness, whereas survival-parameter uncertainty decreases substantially from 1-year to 2-year follow-up. The number of subjects needed (NSN) to achieve <15\% uncertainty on the link parameter is comparable for a 2-year rich design and a 3-year sparse design. Optimal covariate distributions are stable across designs and systematically improve test power, outperforming longer and richer but non-optimised designs. These FIM-based methods accurately predict uncertainty and test powers, enabling design evaluation and NSN computation for joint-model-based clinical studies.

14

Pharmacogenetic Characterization of Cytochrome P450 Genes involved in Psychotropic Medication Metabolism in a Cohort of Patients with Prader-Willi Syndrome

Moreno-Armengol, A.; Pareja, R.; Hernandez-Lazaro, A.; Capel, L.; Corripio, R.; Caixas, A.; Baena, N.

2026-05-18 pharmacology and therapeutics 10.64898/2026.05.09.26352521 medRxiv

Top 0.5%

1.1%

Show abstract

Prader-Willi syndrome (PWS) is a rare multisystemic disorder characterized by obesity, endocrine dysfunctions, and psychiatric comorbidities, which imply frequent use of psychotropic medications. They account for atypical responses to standard dosages of psychiatric drugs. Pharmacogenetics could be part of the reason for this situation, potentially offering a valuable tool for individualized treatment. This study analyzed allelic and phenotypic frequency distributions of five of the main cytochrome P450 enzymes (CYP2D6, CYP2B6, CYP2C19, CYP2C9, CYP3A4) involved in psychiatric drug metabolism in 47 patients with genetically confirmed diagnosis of PWS and compared them to reference frequencies in the general European population. Allelic frequency comparisons between the European reference population and the overall PWS cohort revealed a significant global difference for CYP2B6, with CYP2C19 and CYP2D6 showing trends toward significance. Although no global allelic differences remained significant after false discovery rate correction, post-hoc analyses consistently identified an enrichment of reduced- or non-functional alleles CYP2B619 and CYP2D610 in patients with PWS. Predicted metabolizer phenotype analyses showed a significant shift toward intermediate metabolizers of CYP3A4 in the PWS cohort, with corresponding depletion of normal metabolizers. Subgroup analyses indicated that allelic differences were more pronounced in maternal uniparental disomy and non-deletion subtypes, particularly for CYP2B6, although no significant differences were observed between PWS genetic subtypes. Overall, results imply potential differences in metabolizing activity in PWS patients, and subsequent implications in drug efficacy and tolerability. These results support the idea that pharmacogenetic testing may improve therapeutic decision-making in PWS for psychiatric treatment. Larger studies are needed to confirm these preliminary results.

15

Predicting Intensive Care Readmission Among Hospitalized Children

Arshad, A.; Carey, K. A.; Daniels, L. A.; Jani, P.; Gilbert, E.; Sanchez-Pinto, L. N.; Mayampurath, A.

2026-05-19 pediatrics 10.64898/2026.05.15.26353330 medRxiv

Top 0.5%

1.1%

Show abstract

Objective: Readmissions to the PICU are associated with increased morbidity and mortality. A prediction model that can identify children at risk of readmission at the time of transfer can allow providers to intervene and potentially improve patient outcomes. The objective of this study was to derive and validate machine learning models to predict PICU readmission at the time of transfer. Design: Retrospective observational cohort study Setting: Three quaternary care PICUs in the city of Chicago Patients: All children admitted to the PICU between 2012 and 2019. Measurements: The primary outcome was unplanned readmission to the PICU within 48 hours of transfer to the inpatient ward. Predictor variables included vital signs, patient characteristics, and laboratory results. We developed and externally validated four models to predict PICU readmission: logistic regression, elastic net, random forest, and XGBoost. Main Results: This study included 35,601 patients, with readmission rates ranging from 2.2-3.7% by site. The performance of models during internal validation was consistent at the three sites, with the area under the receiver operating characteristic (AUC) values between 0.70 and 0.73 and no difference across the four models. Model performance decreased significantly during external validation (AUCs of 0.60-0.69). The variables most important to the prediction differed at each site. Conclusion: Machine learning models for predicting readmissions to the PICU have limited generalizability. Locally derived models demonstrated modest performance in our study and could potentially inform provider decision-making if prospectively validated. Externally developed models are unlikely to perform well at predicting PICU readmissions.

16

An electrocardiogram-based machine learning model for distinguishing complete Kawasaki disease.

Nakano, T.; Saito, K.; Noda, K.; Asai, Y.; Kojima, A.; Uchida, H.; Ohira, Y.; Ito, H.; Kawada, J.-i.; Yoshikawa, T.

2026-05-06 pediatrics 10.64898/2026.04.30.26352183 medRxiv

Top 0.6%

1.0%

Show abstract

Kawasaki disease (KD) is a systemic vasculitis in young children, and early diagnosis remains challenging when clinical features are incomplete or overlap with those of other febrile illnesses. Because electrocardiography (ECG) is noninvasive and widely available, we investigated whether ECG-derived features could help distinguish complete KD from pediatric patients with fevers. We conducted a single-center retrospective study of hospitalized febrile children aged 1-8 years who underwent digital 12-lead ECG recording during the initial evaluation. Five amplitude features and six timing features extracted from the ECG were used to develop a logistic regression model to distinguish between complete KD and other febrile illnesses. The model discriminated between the KD and non-KD groups in the validation dataset. The prediction score was not significantly correlated with the age and body temperature. S-wave amplitude, the RR interval, and P-and Q-wave amplitudes were suggested to contribute to discrimination. These findings suggest that ECG-derived features may provide adjunctive information for distinguishing complete KD from other febrile illnesses. Author SummaryKawasaki disease is an inflammatory illness in young children that can lead to coronary artery complications if treatment is delayed. Early diagnosis is often difficult because its initial symptoms overlap with those of many common febrile illnesses. We investigated whether a routine 12-lead electrocardiogram (ECG), which is noninvasive, rapid, and widely available, contains information that can help distinguish complete Kawasaki disease from other febrile conditions. We retrospectively analyzed digital ECGs from hospitalized febrile children and extracted waveform amplitude and timing features. Using these features, we built a logistic regression model and evaluated it in a temporally separate validation cohort. The model distinguished patients with Kawasaki disease from patients with fever. P-, Q-, and S-wave amplitudes and the RR interval were repeatedly selected as important contributors, suggesting that both waveform morphology and heart-rate-related information may be relevant. These findings indicate that ECG-derived features may provide useful adjunctive information during the clinical assessment of complete Kawasaki disease.

17

Video-based Detection of Delirium in Hospitalized Adults

Mendu, M.; Tesh, R. A.; Pellerin, K.; Steward, G. E.; Cerda, I. H.; Williams, M.; Colman, M.; Shah, S.; Lam, A. D.; Cash, S. S.; Westover, M. B.; Kimchi, E. Y.

2026-05-13 geriatric medicine 10.64898/2026.05.11.26352902 medRxiv

Top 0.6%

0.9%

Show abstract

Delirium, a dynamic neuropsychiatric condition associated with morbidity and mortality, remains underdiagnosed due to reliance on subjective, intermittent screening tools. Objective and potentially continuous identification is needed to improve clinical care. We developed and validated an analytic framework for delirium classification based on automatically extracted video features. In this prospective cohort study, patients ([≥] 18 years) admitted to the inpatient medical or neurological ward of a tertiary academic center between August 2020 and March 2022 with an expected stay longer than one night were enrolled. Daily structured delirium assessments and brief video recordings were performed in consenting patients. Videos were analyzed using deep learning pose estimation to extract keypoints and calculate behavioral features based on eye, face, and limb postures and movements. Four machine learning models (logistic regression, gradient boosting, support vector machines, and random forests) were trained to predict delirium status from extracted features. Model performance was evaluated on 20 repetitions of three-fold cross-validation using the area under the curve of the receiver operating characteristics curve (AUC ROC). The cohort included 109 videos from 25 male and 25 female participants (median age: 72, IQR: 63.25-78). Twenty videos (18%) were from patients with delirium. Keypoints for this dataset were more accurately extracted using a customized ResNet-101 model developed with DeepLabCut (sensitivity 0.94, specificity 0.89, compared to human-labeled gold standards) than using off-the-shelf models. Keypoints were then used to generate behavioral features summarizing movement and postures throughout the video. A support vector machine model achieved an average delirium classification AUC ROC of 0.79 (SD {+/-} 0.09), sensitivity of 0.71 (SD {+/-} 0.16), and specificity of 0.78 (SD {+/-} 0.07). This study demonstrates the feasibility of identifying delirium using brief videos in clinically heterogeneous cohorts and reveals novel features for objective identification. Author SummaryDelirium is a sudden change in attention and awareness that commonly affects hospitalized patients. It is linked with longer hospital stays, cognitive decline, and death. Patients with delirium often show changes in movements and behaviors such as slowed movement, restlessness, or excessive scanning of the environment. Since current screening tools rely on intermittent human interactions, they can be subjective and miss the fluctuating nature of delirium, leading to underdiagnosis. We sought to explore whether short video recordings could be used to detect delirium automatically. In our study, we enrolled 50 hospitalized patients and conducted daily delirium assessments and video recordings. We used a machine learning model to analyze patients eye movements, facial expressions, and body postures. We found that video-derived features could be used to identify delirium in a small clinical cohort. While needing further validation in outside cohorts, this study shows an important proof-of-concept for objective delirium monitoring in heterogeneous clinical contexts without adding burden to clinical staff.

18

Temporal Relationships between Smartphone Application Use and Online Substance Procurement in U.S. Youth

Gansner, M.; Adams, M.; Nikam, P.; Huntley, N.; Ramrajesh, S.; Marsch, L. A.; Levy, S.; Schuman-Olivier, Z.

2026-05-19 pediatrics 10.64898/2026.05.15.26353324 medRxiv

Top 0.6%

0.9%

Show abstract

Background: Despite the significant risks associated with online substance procurement (SP), few researchers have examined this practice in U.S. youth. The studies that do exist are cross-sectional and cannot temporally connect specific digital behaviors to online SP. This longitudinal cohort study examined youth SP and digital media habits to determine whether use of certain smartphone applications correlated with increased odds of online SP or being contacted online about procuring drugs or alcohol. Methods: A cohort of U.S. youth (aged 15-20) with a history of non-daily substance use in the 3 months prior to enrollment was recruited to use the digital phenotyping smartphone application EARS for 90 days. On a nightly basis, participants were asked to complete surveys about online experiences related to SP and instances of substance use. Smartphone-generated screen use data were also collected passively each day. Results: Out of 112 enrolled participants, 106 were able to be included in analyses. Over approximately 3 months, 28.3% of participants (n=30) reported a collective 91 instances where they used social media to acquire drugs or alcohol. Screen use data demonstrated temporal relationships between social media SP and applications previously connected to the social media drug-purchasing process (e.g., TikTok, encrypted apps), as well as other school-specific social media. Discussion: Our results provide critically needed research evidence to support a body of literature composed predominantly of anecdotal reports. Despite measures taken by social media companies to prevent use of their platforms for drug procurement, underage youth continue to engage in this practice.

19

Before Birth, Beyond Childhood: Understanding the Influence of Prenatal Substance Exposure on Psychiatric Diagnoses

Houghton, A.; Caola, L.; Dastin-Van Rijn, E.; Anderson, S.; Kummerfeld, E.; Sullivan, C.; Simpson, S.; Kalkar, A.; Banerjee, R.; Fiecas, M.; Randolph, A.

2026-05-29 pediatrics 10.64898/2026.05.27.26354275 medRxiv

Top 0.6%

0.9%

Show abstract

Background: Prenatal substance exposure (PSE) occurs when an individual is exposed to substances in utero. PSEs may have lasting effects on mental health. We tested whether PSEs show threshold, cumulative, or individual substance associations with childhood psychiatric diagnoses. Methods: Clinical variables (demographics, ICD-9/10 diagnoses, PSE history) were extracted from electronic health records from the University of Minnesota Adoption Medicine Clinic. PSEs were identified from caregiver and child-protective-services narratives and/or toxicology (cord tissue/blood, meconium). For each ICD-9/10 diagnostic category, we fit logistic regression models comparing (1) exposure thresholds (0, 1, 2, 3, 4+ exposures), (2) a cumulative exposure count, and (3) individual substances to estimate marginal odds ratios (ORs) with 95% Confidence Intervals (CIs). Results: Psychiatric diagnoses increased with the number of PSEs. Relative to no exposure, odds of an Anxiety Disorder rose from OR 1.47 (95% CI 1.16-1.87) with one exposure to OR 2.03 (1.64-2.52) with >=4 exposures. Higher cumulative exposure scores were associated with Anxiety Disorders (OR 1.28, 1.18-1.38), Behavioral and Emotional Disorders (OR 1.42, 1.31-1.54), Substance Use Disorders (OR 1.52, 1.29-1.79), and Mood Disorders (OR 1.16, 1.04-1.30). Alcohol, tobacco, and marijuana exposures were associated with increased odds of at least one psychiatric diagnosis, and each substance showed at least one significant diagnostic cluster when modeled independently. Conclusion: Increasing numbers of PSEs were associated with higher odds of psychiatric diagnoses, with patterns varying by substance and outcome. These findings motivate research on exposure timing and combinations to support earlier identification and intervention for at-risk children.

20

BioMADE: Predicting Torsades de Pointes from molecular structures through biologically informed representations

Acitores Cortina, J. M.; Schut, M. C.; Tatonetti, N. P.

2026-05-11 bioinformatics 10.64898/2026.05.06.723121 medRxiv

Top 0.7%

0.9%

Show abstract

Drug-induced arrhythmias, particularly Torsades de Pointes (TdP), pose a significant risk to patient safety and can sometimes have life-threatening outcomes. They remain a major concern in drug development and regulation. Machine learning (ML) has become a powerful tool for analyzing complex biological and chemical datasets, enabling researchers to identify subtle patterns that differentiate safe compounds from those likely to cause dangerous cardiac effects. However, most existing in silico approaches do not sufficiently incorporate biological elements, relying heavily on chemical and structural properties or on computationally expensive simulations. Here, we introduce BioMADE, a novel ML framework that harnesses small-molecule-protein activity profiles from publicly available datasets to predict TdP risk without requiring exhaustive mechanistic annotation. Activity data from ChEMBL were used to train individual models for each gene, which predict activity values for any given compound. A curated set of arrhythmia-relevant genes was then used to construct a latent biological embedding (BioMADE embedding) for each molecule. We validated the performance of these features in distinguishing biological elements such as ATC3 class, showing superior classification performance compared with representations such as Molformer (lacks biological information) and MACCS (limited chemical properties) (0.85 AUROC vs 0.81 and 0.73, respectively). BioMADE representations served as input to a support vector machine classifier to discriminate TdP-inducing drugs from safe compounds. BioMADE achieved an AUROC of 0.89 in internal validation, indicating strong predictive performance. Against state-of-the-art models such as ADMEThyst, BioMADE achieved an AUROC of 0.74 on ADMEThysts validation set (vs. 0.72 for ADMEThyst). When we combined both approaches, the AUROC reached 0.77. These results demonstrate that BioMADE provides a scalable, biology-informed, and generalizable approach for predicting drug-induced toxicities. By integrating protein activity profiles into toxicology modeling, our framework highlights the critical role of human biology in adverse drug reaction prediction, an aspect often overshadowed by purely chemical or structural descriptors.